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Abstract 

Realistic videos of human actions exhibit rich spatiotem- 
poral structures at multiple levels of granularity: an action 
can always be decomposed into multiple finer-grained ele¬ 
ments in both space and time. To capture this intuition, we 
propose to represent videos by a hierarchy of mid-level ac¬ 
tion elements (MAEs), where each MAE corresponds to an 
action-related spatiotemporal segment in the video. We in¬ 
troduce an unsupervised method to generate this represen¬ 
tation from videos. Our method is capable of distinguish¬ 
ing action-related segments from background segments and 
representing actions at multiple spatiotemporal resolutions. 

Given a set of spatiotemporal segments generated from 
the training data, we introduce a discriminative cluster¬ 
ing algorithm that automatically discovers MAEs at mul¬ 
tiple levels of granularity. We develop structured models 
that capture a rich set of spatial, temporal and hierarchical 
relations among the segments, where the action label and 
multiple levels of MAE labels are jointly inferred. The pro¬ 
posed model achieves state-of-the-art performance in mul¬ 
tiple action recognition benchmarks. Moreover, we demon¬ 
strate the effectiveness of our model in real-world applica¬ 
tions such as action recognition in large-scale untrimmed 
videos and action parsing. 


1. Introduction 

In this paper we address the problem of learning models 
of human actions and using these models for recognizing 
and parsing human actions from videos. This is a very chal¬ 
lenging problem. Most of the human actions are complex 
spatial-temporal hierarchical processes. Consider, for in¬ 
stance, the action in Fig.[2 This is composed of a collection 
of spatiotemporal processes ranging from the entire action 
sequence, “taking food from fridge” to simple elementary 
actions such as “stretching arm” or “grasping a tomato”. 
Each of these actions is often characterized by a complex 
distribution of motion segments (e.g. open and close), ob¬ 
jects (t.g. fridge mid food), body parts (e.g. arm) along with 
their interactions (e.g. grasp a tomato). Thus, in order to 
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Figure 1. A representation of hierarchical spatiotemporal seg¬ 
ments for action. Our method automatically discovers representa¬ 
tive and discriminative mid-level action elements for a given action 
class. These elements are encoded in the spatiotemporal segments 
which usually cover different aspects of an action at different lev¬ 
els of granularity, ranging from an entire action sequence, which 
comprises the actor along with the objects the actor interacts with 
(the first row of the hierarchy), to the action elements such as fine¬ 
grained body part movements and objects (the last row). 


achieve a full understanding of the action that takes place in 
a scene, one must recognize and parse this complex struc¬ 
ture of mid-level action elements (MAEs) at different levels 
of semantic and spatial-temporal resolution. 

Most of the existing methods cannot do this. A large 
body of work focuses on associating the entire video clip 
with a single class label from a pre-defined set of action cat¬ 
egories (e.g., “take food from fridge” versus “cook food”) 
(Fig. g ED |20l ED - essentially, a video classification 
problem. Methods such as ET] |26l do propose methodolo¬ 
gies for temporally segmenting or parsing the action (e.g., 
“take food from fridge”) into a sequence of sub-action la¬ 
bels (e.g., open fridge, grasp food, close fridge) but can¬ 
not organize these sub-actions into hierarchical structures 
of MAEs such as the one in Fig. Critically, most of 
these methods assume that the fine-grained action labels or 
their temporal structures are pre-defined or hand-specified 
by an expert as opposed to be automatically inferred from 
the videos in a data-driven fashion. This assumption pre¬ 
vents such methods from scaling up to a large number of 
complex actions. Finally, a portion of previous research fo- 
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cuses on modeling an action by just capturing the spatial- 
temporal characteristics of the actor (a [371 [131331 whereby 
neither the objects nor the background the actor interacts 
with are used to better contextualize the classification pro¬ 
cess. Other methods El EH [H ED do propose a holistic 
representation for activities which inherently captures some 
degree of background context in the video, but are unable to 
spatially localize or segment the actors or relevant objects. 

In this work, we propose a model that is capable of 
modeling complex actions as a collection of mid-level ac¬ 
tion elements (MAEs) that are organized in a hierarchi¬ 
cal way. Compared to previous approaches, our frame¬ 
work enables: 1) Multi-resolution reasoning - videos can 
be decomposed into a hierarchical structure of spatiotem- 
poral MAEs at multiple scales; 2) Parsing capabilities - 
actions can be described (parsed) as a rich collection of 
spatiotemporal MAEs that capture different characteristics 
of the action ranging from small body motions, objects to 
large pieces of volumes containing person-object interac¬ 
tions. These MAEs can be spatially and temporally local¬ 
ized in the video sequence; 3) Data-driven learning - the 
hierarchical structure of MAEs as well as the their labels 
do not have to be manually specified, but are learnt and 
discovered automatically using a newly proposed weakly- 
supervised agglomerative clustering procedure. Note that 
some of the MAEs might have clear semantic meanings (see 
Eig. [^, while others might correspond to random but dis¬ 
criminative spatiotemporal segments. In fact, these MAEs 
are learnt so as to establish correspondences between videos 
from the same action class while maximizing their discrim¬ 
inative power for different action classes. Our model has 
achieved state-of-the-art results on multiple action recogni¬ 
tion benchmarks and is capable of recognizing actions from 
large-scale untrimmed video sequences. 

2. Related Work 

The literature on human action recognition is immense, 
we refer the readers to the recent survey O). In the follow¬ 
ing, we only review the related work closely to our work. 

Space-time segment representation: Representing ac¬ 
tions as 2D-Ft tubes is a common strategy for action recog¬ 
nition ESI ED. Recently, there are works that use hier¬ 
archical spatiotemporal segments to capture the multi-scale 
characteristics of actions |[4l[24l. Our representation differs 
in that we can discriminatively discover the mid-level action 
elements (MAEs) from a pool of region proposals. 

Temporal action localization: While most action 
recognition approaches focus on classifying trimmed video 
clips ||20l|8l[36l, there are works that attempt to localize ac¬ 
tion instances from long video sequences E EH EDE ED. 
In ED, a grammar model is developed for localizing action 
and (latent) sub-action instances in the video. Our work 
considers a more detailed parsing at both space and time. 


and at different semantic resolutions. 

Hierarchical structure: Hierarchical structured mod¬ 
els are popular in action recognition due to its capabil¬ 
ity in capturing the multi-level granularity of human ac¬ 
tions (391 [m El 1221 • We follow a similar spirit by rep¬ 
resenting an action as a hierarchy of MAEs. However, most 
previous works focus on classifying single-action video 
clips where they treat these MAEs as latent variables. Our 
method localizes MAEs at both spatial and temporal extent. 

Data-driven action primitives: Action primitives are 
discriminative parts that capture the appearance and motion 
variations of the action EiSEIEZIED. Previous repre¬ 
sentations of action primitives such as interest points (421, 
spatiotemporal patches CD and video snippets (^ typi¬ 
cally lack multiple levels of granularity and structures. In 
this work, we represent action primitives as MAEs, which 
are capable of capturing different aspects of actions rang¬ 
ing from the fine-grained body part segments to the large 
chunks of human-object interactions. A rich set of spatial, 
temporal and hierarchical relations between the MAEs are 
also encoded. Both the MAE labels and the structures of 
MAEs are discovered in a data-driven manner. 

Before diving into details, we first give an overview of 
our method. 1) Hierarchical spatiotemporal segmentation. 
Given a video, we first develop an algorithm to automati¬ 
cally parse the video into a hierarchy of spatiotemporal seg¬ 
ments (see Eig. [^. We run this algorithm for each video 
independently, and in this way, each video is represented as 
a spatiotemporal segmentation tree (Section [^. 2) Learn¬ 
ing. Given a set of spatiotemporal segmentation trees (one 
tree per video) in training, we propose a graphical model 
that captures the hierarchical dependencies of MAE labels 
at different levels of granularity. We consider a weakly su¬ 
pervised setting, where only the action label is provided 
for each training video, while the MAE labels are discrim¬ 
inatively discovered by clustering the spatiotemporal seg¬ 
ments. The structure of the model is defined by the spa¬ 
tiotemporal segmentation tree where inference can be car¬ 
ried out efficiently (Section]^. 3) Recognition and parsing. 
A new video is represented by the spatiotemporal segmen¬ 
tation tree. We run our learned models on the tree for recog¬ 
nizing the actions and parsing the videos into MAE labels 
at different spatial, temporal and semantic resolutions. 

3. Action Proposals: Hierarchical Spatiotem¬ 
poral Segments 

In this section, we describe our method for generating 
a hierarchy of action-related spatiotemporal segments from 
a video. Our method is unsupervised, i.e. during train¬ 
ing, the spatial locations of the persons and objects are not 
annotated. Thus, it is important that the our method can 
automatically extract the action-related spatiotemporal seg¬ 
ments such as actors, body parts and objects from the video. 
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b. Spatiotemporal segments pool 



Figure 2. Constructing the spatiotemporal segment hierarchy. 

(a) Given a video, we first generate action-related region proposals 
for each frame, (b) Then, we cluster these proposals to produce a 
pool of spatiotemporal segments, (c) The last step is to agglomer- 
atively cluster the spatiotemporal segments into a hierarchy. 


An overview of the method is shown in Fig.[^ 



Figure 3. Graphical illustration of the model. In this example, 
we adopt the spatiotemporal hierarchy in Fig. (c). The MAE 
labels are the red circles. The green circles are the features of each 
spatiotemporal segment, and the the blue circle is the action label. 

4. Hierarchical Models for Action Recognition 
and Parsing 


Our method for generating action proposals includes 
three major steps. A. Generating action-related spatial 
segments. We initially generate a diverse set of region pro¬ 
posals using the method of m. This method works on a sin¬ 
gle frame of video, and returns a large number of segmenta¬ 
tion masks that are likely to contain objects or object parts. 
We then score each region proposal using both appearance 
and motion cues, and we look for regions that have generic 
object-like appearance and distinct motion patterns relative 
to their surroundings. We further prune the background re¬ 
gion proposals by training an SVM using the top scored re¬ 
gion proposals as positive examples along with patches ran¬ 
domly sampled from the background as negative examples. 
The region proposals with scores above a threshold (—1) are 
considered as action-related spatial segments. B. Obtain¬ 
ing the spatiotemporal segment pool. Given the action- 
related spatial segments for each frame, we seek to compute 
“tracklets” of these segments over time to construct the spa¬ 
tiotemporal segments. We perform spectral clustering based 
on the color, shape and space-time distance between pairs of 
spatial segments to produce a pool of spatiotemporal seg¬ 
ments. In order to maintain the purity of each spatiotem¬ 
poral segment, we set the number of clusters to a reason¬ 
ably large number. The pool of spatiotemporal segments 
correspond to the action proposals at the finest scale (bot¬ 
tom of Fig. [^. C. Constructing the hierarchy. Starting 
from the initial set of fine-grained spatiotemporal segments, 
we agglomeratively cluster the most similar spatiotemporal 
segments into super-spatiotemporal segments until a single 
super-spatiotemporal segment is left. In this way, we pro¬ 
duce a hierarchy of spatiotemporal segments for each video 
that forms a tree structure, i.e. action proposals at different 
levels of granularity. Due to space constraints, we refer the 
details of the method to the supplementary material. 


So far we have explained how to parse a video into a tree 
of spatiotemporal segments. We run this algorithm for each 
video independently, and in this way, each video is repre¬ 
sented as a tree of spatiotemporal segments. Our goal is 
to assign each of these segments to a label so as to form 
a mid-level action element (MAE). We consider a weakly 
supervised setting. During training, only the action label is 
provided for each video. We discover the MAE labels in an 
unsupervised way by introducing a discriminative cluster¬ 
ing algorithm that assigns each spatiotemporal segment to 
an MAE label (Section [4. 1 1 ). In Section [4^ we introduce 
our models for action recognition and parsing, which are 
able to capture the hierarchical dependencies of the MAE 
labels at different levels of granularity. 

We start by describing the notations. Given a video Vn, 
we first parse it into a hierarchy of spatiotemporal seg¬ 
ments, denoted by 14 = • 'I = • • • ? ^n} follow¬ 

ing the procedure introduced in Section We extract fea¬ 
tures Xn from these spatiotemporal segments in the form 
of Xn = {xi : i = 0, 15 • • • 5 where xq is the root 
feature vector, computed by aggregating the feature de¬ 
scriptors of all spatiotemporal segments in the video, and 
Xi (i = 1, ..., Mn) is the feature vector extracted from the 
spatiotemporal segment Vi (see Eig.[^. 

During training, each video Vn is annotated with an ac¬ 
tion label Yn ^ y and y is the set of all possible ac¬ 
tion labels. We denote the MAE labels in the video as 
Hn = {hi : i = 1,..., Mn}, where hi eH is the MAE la¬ 
bel of the spatiotemporal segment Vi and H is the set of 
all possible MAE labels (see Eig. [^. Eor each training 
video, the MAE labels Hn are automatically assigned to 
clusters of spatiotemporal segments by our discriminative 
clustering algorithm (Section [4G] ). The hierarchical struc¬ 
ture above can be compactly described using the notation 
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Take out from oven 


Qn = (Vn,fn)5 where a vertex Vi G Vn denotes a spa- 
tiotemporal segment, and an edge {vi^Vj) G £n represents 
the interaction between a pair of spatiotemporal segments. 
In the next section, we describe how to automatically assign 
MAE labels to clusters of spatiotemporal segments. 

4.1. Discovering Mid-level Action Elements (MAEs) 



Given a set of training videos with action labels, our goal 
is to discover the MAE labels 1-L by assigning the clusters 
of spatiotemporal segments (Section to the correspond¬ 
ing cluster indices. Consider the example in Eig.[^ the in¬ 
put video is annotated with an action label “take food from 
fridge” in training, and the MAEs should describe the ac¬ 
tion at different resolutions ranging from the fine-grained 
action and object segments (e.g. fridge, tomato, grab) to the 
higher-level human-object interactions (e.g. open fridge, 
close fridge). These MAE labels are not provided in train¬ 
ing, but are automatically discovered by a discriminative 
clustering algorithm on a per-category basis. That means 
the MAEs are discovered by clustering the spatiotemporal 
segments from all the training videos within each action 
class. The MAEs should satisfy two key requirements: 1) 
inclusivity - MAEs should cover all, or at least most, varia¬ 
tions in the appearance and motion of the spatiotemporal 
segments in an action class; 2) discriminability - MAEs 
should be useful to distinguish an action class from others. 

Inspired by the recent success of discriminative clus¬ 
tering in generating mid-level concepts (381, we develop 
a two-step discriminative clustering algorithm to discover 
the MAEs. 1) Initialization: we perform an initial clus¬ 
tering to partition the spatiotemporal segments into a large 
number of homogeneous clusters, where each cluster con¬ 
tains segments that are highly similar in appearance and 
shape. 2) Discriminative algorithm: a discriminative clas¬ 
sifier is trained for each cluster independently. Based on the 
discriminatively-learned similarity, the visually consistent 
clusters will then be merged into mid-level visual patterns 
(i.e. MAEs). The discriminative step will make sure that 
each MAE pattern is different enough from the rest. The 
two-step algorithm is explained in details below. 

Initialization. We run standard spectral clustering on 
the feature space of the spatiotemporal segments to obtain 
the initial clusters. We define a similarity between every 
pair of spatiotemporal segments Vi and Vj extracted from 
all of the training videos of the same class: K{vi,Vj) = 
ex.p{-dbow {vi,Vj)-d spatial {vi.Vj)), where d^ow is the his¬ 
togram intersection distance on the BoW representations of 
the dense trajectory features (411 : and dspatial denotes the 
Euclidean distance between the averaged bounding boxes of 
spatiotemporal segments Vi and Vj in terms of four cues: x- 
y locations, height and width. In order to keep the purity of 
each cluster, we set the number of clusters quite high, pro¬ 
ducing around 50 clusters per action. We remove clusters 


Figure 4. Visualization of Mid-level Action Elements (MAEs). 

The figure shows two clusters (i.e. two MAEs) from the action cat¬ 
egory “take out from oven”. Each image shows the first frame of 
a spatiotemporal segment and we only visualize five examples in 
each MAE. The two clusters capture two different temporal stages 
of “take out from oven”. More visualizations are available in the 
supplementary material. 

with less than 5 spatiotemporal segments. 

Discriminative Algorithm. Given an initial set of clus¬ 
ters, we train a linear SVM classifier for each cluster on the 
BoW feature space. We use all spatiotemporal segments 
in the cluster as positive examples, and negative examples 
are spatiotemporal segments from other action classes. Eor 
each cluster, we run the trained discriminative classifier on 
all other clusters in the same action class. We consider 
the top K scoring detections of each classifier. We define 
the affinity between the initial clusters Ci and Cj as the fre¬ 
quency that classifier i and j fire on same cluster. 

Eor each action class, we compute the pairwise affini¬ 
ties between all initial clusters, to obtain the affinity matrix. 
Next we perform spectral clustering on the affinity matrix 
of each action independently to produce the MAE labels. In 
this way, the spatiotemporal segments in the training set are 
automatically grouped into clusters in a discriminative way, 
where the index of each cluster corresponds to an MAE la¬ 
bel hi ^ 1-L, where H denotes the set of all possible MAE 
labels. We visualize the example MAE clusters in Eig.|^ 

4.2. Model Formulation 

Eor each video, we have a different tree structure 
Qn obtained from the spatiotemporal segmentation algo¬ 
rithm (Section [^. Our goal is to jointly model the compat¬ 
ibility between the input feature vectors Xn, and the action 
label and MAE labels (Xn and Hn), as well as the depen¬ 
dencies between pairs of MAE labels. We achieve this by 
using the following potential function: 

Sv^{Xn,Y„,Hn) = X] 

*ev„ iev„ (i,j)e£ri 

+ y] l^hi,hAj +Vy„xo (1) 

MAE Model aj, Xi : This potential captures the compat¬ 
ibility between the MAE hi and the feature vector Xi of the 
i-th spatiotemporal segment. In our implementation, rather 
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than using the raw feature BH . we use the output of the 
MAE classifier on the feature vector of spatiotemporal seg¬ 
ment i. In order to learn biases between different MAEs, 
we append a constant 1 to make xi 2-dimensional. 

Co-occurrence Model bY^,hi,b'^. h ' potential 

captures the co-occurrence constraints between pairs of 
MAE labels. Since the MAEs are discovered on a per-action 
basis, thus we restrict the co-occurrence model to allow for 
only action-consistent types: hy^^hi = 0 if the MAE hj 
is generated from the action class Yn, and — oo otherwise. 
Similarly, ^, = 0 if the pair of MAEs hi and hj are 
generated from the same action class, and — oc otherwise. 

Spatial-Temporal Model ^, dij : This potential cap¬ 
tures the spatiotemporal relations between a pair of MAEs 
hi and hj. In our experiments, we explore a simplified ver¬ 
sion of the spatiotemporal model with a reduced set of struc- 
tures: I3h,,hjdij = + l3lMnt{i,j). The simpli- 

fication states that the relative spatial and temporal relation 
of a spatiotemporal segment i with respect to its parent j is 
dependent on the segment type hi, but not its parent type hj . 
To compute the spatial feature bins ^ we divide a video frame 
into 5x5 cells, and bins{i) = 1 if the i-th spatiotemporal 
segment falls into the m-th cell, otherwise 0. bint{i,j) is 
a temporal feature that bins the relative temporal location 
of spatiotemporal segment i and j into one of three canon¬ 
ical relations including before, co-occur, and after. Hence 
bint{i, j) is a sparse vector of all zeros with a single one for 
the bin occupied by the temporal relation between i and j. 

Root Model xq : This potential function captures the 
compatibility between the global feature xq of the video Vn 
and the action class Y^. In our experiment, the global fea¬ 
ture xo is computed as the aggregation of feature descriptors 
of all spatiotemporal segments in the video. 

4.3. Inference 

The goal of inference is to predict the hierarchical la¬ 
beling for a video, including the action label for the whole 
video as well as the MAE labels for spatiotemporal seg¬ 
ments at multiple scales. Eor a video Vn, our inference 
corresponds to solving the following optimization prob- 
lem: Iy*,H*) = arginaxy„,ff„ For 

the video Vn, we jointly infer the action label Yn of the 
video and the MAE labels Hn of the spatiotemporal seg¬ 
ments. The inference on the tree structure is exact; and we 
solve it using belief propagation. We emphasize that our 
inference returns a parsing of videos including the action 
label and the MAE labels at multiple levels of granularity. 

4.4. Learning 

Given a collection of training examples in the form of 
{Xn, Hn, Yn}, we adopt a structured SVM formulation to 
learn the model parameters w. In the following, we develop 
two learning frameworks for action recognition and parsing 


respectively. 

Action Recognition. We consider a weakly supervised 
setting. Eor a training video Vn, only the action label Yn is 
provided. The MAE labels Hn are automatically discovered 
using our discriminative clustering algorithm. We formulate 
it as follows: 

min -llrclP + 

^,^>0 2 " " 

n 

Sv^{X^,H^,Y^)-Sv^{X^,H\Y^) 

> Ao/l(F^F*)-en,Vn, (2) 

where the loss function Aq/i is a standard 0-1 

loss that measures the difference between the ground-truth 
action label Y'^ and the predicted action F* for the n-th 
video. We use the bundle optimization solver in (SI to solve 
the learning problem. 

Action Parsing. In the real world, a video sequence 
is usually not bounded for a single action, but may con¬ 
tain multiple actions of different levels of granularity: some 
actions occur in a sequential order; some actions could be 
composed of finer-grained MAEs. See Eig.[7]for examples. 

The proposed model can naturally be extended for ac¬ 
tion parsing. Similar to our action recognition framework, 
the first step of action parsing is to construct the spatiotem¬ 
poral segment hierarchy for an input video sequence Vn, 
as shown in Eig. The only difference is that the input 
video is not a short video clip, but a long video sequence 
composed of multiple action and MAE instances. In train¬ 
ing, we first associate each automatically discovered spa¬ 
tiotemporal segment with a ground truth action (or MAE) 
label. If the spatiotemporal segment contains more than 
one ground truth label, we choose the label with the max¬ 
imum temporal overlap. We use Zn to denote the ground 
truth action and MAE labels associated with the video Vn : 
Fn = ^ = 1, •. •, Mn}, where Zi ^ Z the ground 

truth action (or MAE) label of the spatiotemporal segment 
Vi, and Mn is the total number of spatiotemporal segments 
discovered from the video. The goal of training is to learn 
a model that can parse the input video into a label hierarchy 
similar to the ground truth annotation We formulate it 
as follows: 


min 


n 


- SvSX^,Z*) > A (Z”,Z*) -en,Vn, 

(3) 


where A (F’^,F*) is a loss function for action 
parsing, which we define as: A = 

^ 0/1 where Z” is the ground 

truth label hierarchy, F* is the predicted label hierarchy 
and Mn is the total number of spatiotemporal segments. 
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Note that the learning framework of action parsing is 
similar to Eq. and the only difference lies in the loss 
function: we penalize incorrect predictions for every node 
of the spatiotemporal segments hierarchy. 

5. Experiments 

We conduct experiments on both action recognition and 
parsing. We first describe the datasets and experimental set¬ 
tings. We then present our results and compare with the 
state-of-the-art results on these datasets. 

5.1. Experimental Settings and Baselines 

We validate our methods on four challenging benchmark 
datasets, ranging from fine-grained actions (MPI Cooking), 
realistic actions in sports (UCF Sports) and movies (Holly- 
wood2) to untrimmed action videos (THUMOS challenge). 
In the following, we briefly describe the datasets, experi¬ 
mental settings and baselines. 

MPI Cooking dataset fSSl is a large-scale dataset of 
65 fine-grained actions in cooking. It contains in total 44 
video sequences (or equally 5609 video clips, and 881, 755 
frames), continuously recorded in kitchen. The dataset is 
very challenging in terms of distinguishing between actions 
of small inter-class variations, e.g. cut slices and cut dice. 
We split the dataset by taking one third of the videos to form 
the test set and the rest of the videos are used for training. 

UCF-Sports dataset ||34l consists of 150 video clips ex¬ 
tracted from sports broadcasts. Compared to MPI Cooking, 
the scale of UCF-Sports is small and the durations of the 
video clips it contains are short. However, the dataset poses 
many challenges due to large intra-class variations and cam¬ 
era motion. For evaluation, we apply the same train-test 
split as recommended by the authors of CD. 

Hollywood2 dataset is composed of 1,707 video 
clips (823 for training and 884 for testing) with 12 classes 
of human actions. These clips are collected from 69 Holly¬ 
wood movies, divided into 33 training movies and 36 test¬ 
ing movies. In these clips, actions are performed in realistic 
settings with camera motion and great variations. 

THUMOS challenge 2014 (Hi contains over 254 hours 
of temporally untrimmed videos and 25 million frames. We 
follow the settings of the action detection challenge. We 
use 200 untrimmed videos for training and 211 untrimmed 
videos for testing. These videos contain 20 action classes 
and are a subset of the entire THUMOS dataset. We 
consider a weakly supervised setting: in training, each 
untrimmed video is only labeled with the action class that 
the video contains, neither spatial nor temporal annotations 
are provided. Our goal is to evaluate the ability of our model 
in automatically extracting useful mid-level action elements 
(MAEs) and structures from large-scale untrimmed data. 

Baselines. In order to comprehensively evaluate the per¬ 
formance of our method, we use the following baseline 


MPI Cooking 

Per-Class 

DTF 135114T1 

38.5 

root model (ours) 
full model (ours) 

43.2 

48.4 


Table 1. Comparison of action recognition accuracies of different 
methods on the MPI Cooking dataset. 


UCF-Sports 

Per-Class 

Hollywood2 

mAP 

Fan et al. lll9l 

73.1 

Gaidon et al. lIlOl 

54.4 

Tian et al. 1401 

75.2 

Oneata et al. 1^ 

62.4 

Raptis et al. (32) 

79.4 

Jain et al. (TT) 

62.5 

Ma et al. 

81.7 

Wang et al. ITT] 

64.3 

IDTFI4T1 

79.2 

IDTF ED 

63.0 

root model (ours) 

80.8 

root model (ours) 

64.9 

full model (ours) 

83.6 

full model (ours) 

66.3 


Table 2. Comparison of our results to the state-of-the-art methods 
on UCF-Sports and Holly wood2 datasets. Among all of the meth¬ 
ods, 1^.1241. tTOl and our full model use hierarchical structures. 


THUMOS (untrimmed) 

mAP 

TDTFim 

63.0 

sliding window 

63.8 

INRIA (temporally supervised) II29I 

66.3 

full model (ours) 

65.4 


Table 3. Comparison of action recognition accuracies of different 
methods on the THUMOS challenge (untrimmed videos). 

methods. 1) DTF: the first baseline is the dense trajectory 
method BTl , which has produced the state-of-the-art perfor¬ 
mance in multiple action recognition benchmarks. 2) IDTF: 
the second baseline is the improved dense trajectory feature 
proposed in ED, which uses fisher vectors (FV) 1301 to en¬ 
code the dense trajectory features. FV encoding |4T] |27l 
has been shown an improved performance over traditional 
Bag-of-Features encoding. 3) root model: the third baseline 
is equivalent to our model without the hierarchical struc¬ 
ture, which only uses the IDTF features that fall into the 
spatiotemporal segments discovered by our method, while 
ignoring those in the background. 4) sliding window: the 
fourth baseline runs sliding windows of different lengths 
and step sizes on an input video sequence, and performs 
non-maximum suppression to find the correct intervals of 
an action. This baseline is only applied to action recogni¬ 
tion of untrimmed videos and action parsing. 

5.2. Experimental Results 

We summarize the action recognition results on multiple 
benchmark datasets in Table and [^respectively. 

Action recognition. Most existing action recognition 
benchmarks are composed of video clips that have been 
trimmed according to the action of interest. On all three 
benchmarks (i.e. UCF-Sports, Hollywood2 and MPI), our 
full model with rich hierarchical structures significantly 
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outperforms our own baseline root model (i.e. our model 
without hierarchical structures), which only considers the 
dense trajectories extracted from the spatiotemporal seg¬ 
ments discovered by our method. We can also observe that 
the root model consistently improves dense trajectories 1411 
on all three datasets. This demonstrates that our automat¬ 
ically discovered MAEs fire on the action-related regions 
and thus remove the irrelevant background trajectories. 

We also compare our method with the most recent results 
reported in the literature for UCF-Sports and Hollywood2. 
On UCF-Sports, all presented results follow the same train- 
test split lIT^ . The baseline IDTF ||24l is among the top 
performance. Ma et al. 1241 reported 81.7% by using a bag 
of hierarchical space-time segments representation. We fur¬ 
ther improve their results by around 2%. On Hollywood2, 
our method also achieves state-of-the-art performance. The 
previous best result is from 1411 . We improve it further 
by 2%. Compared to the previous methods, our method is 
weakly supervised and does not require expensive bound¬ 
ing box annotations in training (e.g. |[T3|40l|32l) or human 
detection as input (e.g. mill). 

On THUMOS challenge that is composed of realistic 
untrimmed videos (Table [^, our method outperforms both 
IDTF and the sliding window baseline. Given the scale of 
the dataset, we skip the time-consuming spatial region pro¬ 
posals and represent action as a hierarchy of temporal seg¬ 
ments, i.e. each frame is regarded as a “spatial segment”. 
Our method automatically identifies the temporal segments 
that are both representative and discriminative for each ac¬ 
tion class without any temporal annotation of actions in 
training. We also compare our methods with the best sub¬ 
mission (INRIA 1^ ) of the temporal action localization 
challenge in THUMOS 2014. INRIA 1^ uses a mixture of 
IDTF (m, SIFT 1231, color features Q and the CNN fea¬ 
tures ns. Also, their model 1^ is temporally supervised, 
which uses temporal annotations (the start and end frames 
of actions in untrimmed videos) and additional background 
videos in training. Our method achieves a competitive per¬ 
formance (within 1%) using only IDTF BTl and doesn’t 
require any temporal supervision. We provide the average 
precisions (AP) of all the 20 action classes in Fig. Our 
method outperforms in 10 out of the 20 classes, espe¬ 
cially Diving and CleanAndJerk, which contain rich struc¬ 
tures and significant intra-class variations. 

Action parsing. Given a video sequence that contains 
multiple action and MAE instances, our goal is to localize 
each one of them. Thus during training, we assume that all 
of the action and MAE labels as well as their temporal ex¬ 
tent are provided. This is different from action recognition 
where all of the MAEs are unsupervised. We evaluate the 
ability of our method to perform action parsing by measur¬ 
ing the accuracy in temporally localizing all of the action 
and MAE instances. An action (or MAE) segment is con- 
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Figure 5. Average precisions of the 20 action classes of untrimmed 
videos from the temporal localization challenge in THUMOS. 


Temporal Localization Accuracy 



Figure 6. Action parsing performance. We report mean Average 
precision (mAP) of our method and the sliding window baseline 
on MPI Cooking with respect to different overlapping thresholds 
that determine whether an action (or MAE) segment is correctly 
localized. 

sidered as true positive if it overlaps with the ground truth 
segment beyond a pre-defined threshold. We evaluate the 
mean Average Precision (mAP) with the overlap threshold 
varying from 0.1 to 0.5. 

We use the original fine-grained action labels provided 
in the MPI Cooking dataset as the MAEs at the bottom 
level of the hierarchy, and automatically generate a set of 
higher-level labels by composing the fine-grained action la¬ 
bels. The detailed setups are explained in the supplementary 
material. Examples of higher-level action labels are: “cut 
apart - put in bowl”, “screw open - spice - screw close”. 
We only consider labels with length ranging from 1 to 4, 
and occur in the training set for more than 10 times. In this 
way, we have in total 120 action and MAE labels for parsing 
evaluation. 

We compare our result with the sliding window baseline. 
The curves are shown in Fig.[^ Our method shows consis- 
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(b) A test video containing “spice”, “screw close” and “put in spice holder” actions. 


Figure 7. Action parsing. This figure shows the output of our action parser for two test videos. For each video, we visualize the inferred 
fine-grained action labels (shown on top of each image), the MAE segments (the red masks in each image) and the parent-child relations 
(the orange line). As we can see, our action parser is able to parse long video sequences into representative action patterns (i.e. MAEs) at 
multiple scales. Note that the figure only includes a few representative nodes of the entire tree obtained by our parser, we provide more 
visualizations in the supplementary material. 


tent improvement over the baseline using different overlap 
threshold. If we consider an action segment is correctly lo¬ 
calized based on “intersection-over-union” score larger than 
0.5 (the PASCAL VOC criterion), our method outperforms 
the baseline by 8.5%. The mean performance gap (averaged 
over all different overlap threshold) between our method 
and the baseline is 8.6%. Some visualizations of action 
parsing results are shown in Fig. [7] As we can see, the story 
of human actions is more than just the actor: as shown in 
the figure, the automatically discovered MAEs cover dif¬ 
ferent aspects of an action, ranging from human body and 
parts to spatiotemporal segments that are not directly re¬ 
lated to humans but carry significant discriminative power 
(e.g. a piece of fridge segment for the action “take out from 
fridge”). This diverse set of mid-level visual patterns are 
then organized in a hierarchical way to explain the complex 
store of the video at different levels of granularity. 


6. Conclusion 

We have presented a hierarchical mid-level action ele¬ 
ment (MAE) representation for action recognition and pars¬ 
ing in videos. We consider a weakly supervised setting, 
where only the action labels are provided in training. Our 
method automatically parses an input video into a hierarchy 
of MAEs at multiple scales, where each MAE defines an 
action-related spatiotemporal segment in the video. We de¬ 
velop structured models to capture the rich semantic mean¬ 
ings carried by these MAEs, as well as the spatial, tem¬ 
poral and hierarchical relations among them. In this way, 
the action and MAE labels at different levels of granularity 
are jointly inferred. Our experimental results demonstrate 
encouraging performance over a number of standard base¬ 
line approaches as well as other reported results on several 
benchmark datasets. 
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